Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 1 - Data Preprocessing/[R] Data Preprocessing.ipynb
1002 views
Kernel: R

Data Preprocessing

Import the dataset

dataset = read.csv('Data.csv')
dataset # Unlike python indexing starts with 1 in R

Taking care of missing data

dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary)
dataset

Encoding categorical data

dataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany'), labels = c(1, 2, 3))
dataset
dataset$Purchased = factor(dataset$Purchased, levels = c('No', 'Yes'), labels = c(0, 1))
dataset

Splitting the dataset into the Training set and Test set

# install.packages('caTools')
library(caTools)
set.seed(42) split = sample.split(dataset$Purchased, SplitRatio = 0.8)
split # TRUE = Training set, FALSE = Test set
training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
training_set
dim(training_set)[1]
test_set
dim(test_set)[1]

Feature Scaling

training_set = scale(training_set) test_set = scale(test_set)
Error in colMeans(x, na.rm = TRUE): 'x' must be numeric Traceback: 1. scale(training_set) 2. scale.default(training_set) 3. colMeans(x, na.rm = TRUE)
training_set[, 2:3] = scale(training_set[, 2:3]) test_set[, 2:3] = scale(test_set[, 2:3])
training_set
test_set